20180608 一個更人性化的方式推薦你專輯

之前的代碼我們學過如何從評價和類別篩選出專輯,
現在我有一個更人性化的方案來做這件事,
輸入一個喜歡的藝人,隨機推薦你三張專輯。
從代碼上來看,沒有用到更新的技巧,

流程大概如下:
1.爬取該artist的網址
2.爬取該artist的音樂分類
3.建立音樂分類辭典
4.將先前取得的音樂分類,轉換並建立之後POST爬蟲以所需要的headers和data
5.以建立的data (分類與評價) 開始爬取所有專輯 ,加入List
6.隨機在list內選擇3張專輯

這將會使的整個程式執行起來相當慢,
也許之後可以加入一些import file或異步(asynico)的方式來做更高級的爬蟲。

1.爬取該artist的網址

#導入模組
from seleniumrequests import Chrome 
import requests
import re
from bs4 import BeautifulSoup
import random

#輸入想要查詢的專輯,並將關鍵字直接給到amg_artist的search
artist_input = input('type artist you like: ')
search_url = "https://www.allmusic.com/search/artists/" + artist_input

#接著用爬蟲的方式將search_url,將該關鍵字的artist網址抓取出來,用到seleniumrequestsd
chrome_path = r"C:\Users\Ramone\seleniumdriver\chrome\chromedriver.exe" # 給定一個瀏覽器的local位置
webdriver = Chrome(chrome_path) # 導入Chorme當作webdriver
search_res = webdriver.request('GET',search_url)
search_soup = BeautifulSoup(search_res.text,'lxml')
search_source = search_soup.find('div',{'class':'name'})
artist_url = re.search(re.compile(r'https://www.allmusic.com/artist/.*(?<=\d)'),str(search_source)).group()
print (artist_url) 

###output###
#https://www.allmusic.com/artist/the-beach-boys-mn0000041874

2.用剛剛得到的artist網址(artist_url),爬取其在AMG的音樂分類

#用開發人員工具找到style的規則丟給bs
webdriver = Chrome(chrome_path) #可省
artist_res = webdriver.request('GET',artist_url)
artist_soup = BeautifulSoup(artist_res.text,'lxml')
artist_source = artist_soup.find_all('a',{'href':re.compile(r'https://www.allmusic.com/style/.*')})

#將爬取結果加到list裏頭待之後取用
style_list=[]
for s in artist_source:
    style_list.append(s.text)
    print (s.text)

3.接著爬取音樂分類在AMG的代碼並建立字典 (與前篇的代碼相同)

webdriver = Chrome(chrome_path) # 導入Chorme當作webdriver
#建立Label字典
label_res= webdriver.request('GET','https://www.allmusic.com/advanced-search/') 
label_soup = BeautifulSoup(label_res.text,"lxml")
label = label_soup.find_all('input',{'id':re.compile('genreid.*?')})
label_dict={}
for l in label:
    label_dict[l['value']]=l['id']

#建立評價字典
rating_dict={}
star=1.0
for i in range(1,10):
    rating_dict[str(star)]='editorialrating:'+str(i)
    star+=0.5

#print 
for i in style_list:
    print (i,' : ',label_dict[i])

###output###
#AM Pop  :  subgenreid:MA0000012000
#Early Pop/Rock  :  subgenreid:MA0000002763
#Surf  :  subgenreid:MA0000002883
#Contemporary Pop/Rock  :  subgenreid:MA0000004443
#Sunshine Pop  :  subgenreid:MA0000012028
#Psychedelic Pop  :  subgenreid:MA0000011915
#Rock & Roll  :  subgenreid:MA0000002829
#Psychedelic/Garage  :  subgenreid:MA0000002800

4.建立post所需要用的headers和data,
headers直接給值,data則從剛剛style_list來建立。

#AMG固定header
amg_header={
'authority': r'www.allmusic.com',
'method': r'POST',
'path': r'/advanced-search/results/',
'scheme': r'https',
'accept': r'text/html, */*; q=0.01',
'accept-encoding': r'gzip, deflate, br',
'accept-language': r'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6',
'content-length': '185',
'content-type': r'application/x-www-form-urlencoded; charset=UTF-8',
'cookie': r'_ga=GA1.2.85029673.1513518205; __gads=ID=3275e8321c618a22:T=1513518176:S=ALNI_MbT7eOHrtfYxgOBlXi-4NZzwkA01Q; __qca=P0-704611526-1513518207321; policy=notified; _gid=GA1.2.15574113.1527937083; bm_monthly_unique=true; registration_prompt=true; bm_last_load_status=BLOCKING; advancedSearchLogic=and; allmusic_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22f55df85c5fd33a3550642ff7f525e829%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A11%3A%2210.128.8.31%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A114%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F67.0.3396.62+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1528081392%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22user%22%3Bi%3A0%3B%7D6e67ed658981abc1f6d25c605bcee246; _gat=1',
'origin': r'https://www.allmusic.com',
'referer': r'https://www.allmusic.com/advanced-search',
'user-agent': r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'x-requested-with': r'XMLHttpRequest',
}

#由先前之條件來建立data

#其中因考量專輯的廣度,當分類太多時,只隨機取得3種分類。(分類太多交集在一起資料太少)
if len(style_list) > 3:
    amg_style = random.sample(style_list,3)
else:
    amg_style = style_list

#建立post所需音樂類別的data 
amg_label=""   
for i in amg_style:
    if i==amg_style[0]:
        amg_label=amg_label+label_dict[i]
    else:
        amg_label=amg_label+"%26"+label_dict[i] #%26= & ((不確定為何一定要decode

#建立post所需音樂評價的data       
amg_rating=rating_dict['5.0']+'|'+rating_dict['4.5']+'|'+rating_dict['4.0']

print (amg_label)
print (amg_rating)
print (amg_style)

###output###
#subgenreid:MA0000002800%26subgenreid:MA0000011915%26subgenreid:MA0000002829%26subgenreid:MA0000012028
#editorialrating:9|editorialrating:8|editorialrating:7
#['Psychedelic/Garage', 'Psychedelic Pop', 'Rock & Roll', 'Sunshine Pop']

5.開始爬曲專輯,以data(篩選條件)來做post,收集所有專輯資料並加進名為recommend的list中 (與前篇代碼雷同)

#將headers,data包給selenium-request做post請求
webdriver = Chrome(chrome_path)
res= webdriver.request('POST','https://www.allmusic.com/advanced-search/results/'
                       ,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
res_soup = BeautifulSoup(res.text,'lxml')

#用開發人員工具,找到換頁的模式的規則產生下一次的respond
next_page = res_soup.find_all('span',{'class':'next'})
amg_url = 'https://www.allmusic.com/'
next_url=re.compile(r'(?=/advanced-search/results/)/advanced-search/results/\d+')
recommend = []

#建立濾除換行符號並加入list的function
def add_list(y):
    y_list=[]
    for x in y:
        x_text=x.text
        x_str=x_text.strip()
        y_list.append(x_str)
    return y_list

#建立BS抓取原始碼的function並用class包裝起來
class get_info():
    def __init__(self,artist,title,year):
        self.artist=artist
        self.title=title
        self.year=year
def get_source():
    artist = res_soup.find_all('td',{'class':'artist'})
    title = res_soup.find_all('td',{'class':'title'})
    year = res_soup.find_all('td',{'class':'year'})
    return get_info(artist,title,year)

#當資料小於等於六筆,放寬評價到3.5顆星,重新跑一次post。
if len(add_list(get_source().title))<7:
    print ('need more low rating album, please wait...')
    amg_rating='editorialrating:9|editorialrating:8|editorialrating:7|editorialrating:6'
    webdriver = Chrome(chrome_path)
    res= webdriver.request('POST','https://www.allmusic.com/advanced-search/results/'
                           ,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
    res_soup = BeautifulSoup(res.text,'lxml')

    #用開發人員工具,找到換頁的模式的規則產生下一次的respond
    next_page = res_soup.find_all('span',{'class':'next'})
    amg_url = 'https://www.allmusic.com/'
    next_url=re.compile(r'(?=/advanced-search/results/)/advanced-search/results/\d+')

    i=0   
    while i <len(add_list(get_source().title)):
        recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
        i=i+1

#只有一頁的時候就做這個 (大於六筆資料)
elif next_page == [] and len(add_list(get_source().title))>6:
    print ('laoding...only one page')
    i=0   
    while i <len(add_list(get_source().title)):
        recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
        i=i+1

#有很多頁的話就做這個
else:
    next_res = amg_url+re.search(next_url,str(next_page[-1])).group()
    while next_page != []:
        print ('loading...multi-page' )
        i=0 
        while i <len(add_list(get_source().title)):
            recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
            i=i+1
        res= webdriver.request('POST',next_res,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
        res_soup = BeautifulSoup(res.text,'lxml')
        next_page = res_soup.find_all('span',{'class':'next'})
        if next_page != []:
            next_res = amg_url+re.search(next_url,str(next_page[-1])).group()
        else:
            print ('loading...last page' )   
            i=0 
            while i <len(add_list(get_source().title)):
                recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
                i=i+1

#全部的list            
#for r in recommend:
#    print (r)

6.隨機選擇recommend中的3張專輯

for r in random.sample(recommend,3):
    print (r)

Source credit: All Music : https://www.allmusic.com/
No copyright infringement intended.

results matching ""

    No results matching ""